102 research outputs found
Matching Subsequences in Trees
Given two rooted, labeled trees and the tree path subsequence problem
is to determine which paths in are subsequences of which paths in . Here
a path begins at the root and ends at a leaf. In this paper we propose this
problem as a useful query primitive for XML data, and provide new algorithms
improving the previously best known time and space bounds.Comment: Minor correction of typos, et
Structured Text Retrieval Models
Structured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language [4]: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text modelâ\u80\u99s word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like â\u80\u9ccontaining â\u80\u9d and â\u80\u9ccontained-by â\u80\u9d to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like â\u80\u9cI want a paragraph discussing formal models near to a table discussing the differences between databases and information retrievalâ\u80\u9d. Here, â\u80\u9cformal models â\u80\u9d and â\u80\u9cdifferences between databases and information retrieval â\u80\u9d should match the content that needs to be retrieved from the database, whereas â\u80\u9cparagraph â\u80\u9d and â\u80\u9ctable â\u80\u9d refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed below. HISTORICAL BACKGROUND The STAIRS system (Storage and Information Retrieval System), which was developed at IBM already in the late 1950â\u80\u99s allowed querying both content and structure. Much like todayâ\u80\u99s On-line Public Access Catalogues, it wa
Fast Searching in Packed Strings
Given strings and the (exact) string matching problem is to find all
positions of substrings in matching . The classical Knuth-Morris-Pratt
algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear
time which is optimal if we can only read one character at the time. However,
most strings are stored in a computer in a packed representation with several
characters in a single word, giving us the opportunity to read multiple
characters simultaneously. In this paper we study the worst-case complexity of
string matching on strings given in packed representation. Let be
the lengths and , respectively, and let denote the size of the
alphabet. On a standard unit-cost word-RAM with logarithmic word size we
present an algorithm using time O\left(\frac{n}{\log_\sigma n} + m +
\occ\right). Here \occ is the number of occurrences of in . For this improves the bound of the Knuth-Morris-Pratt algorithm.
Furthermore, if our algorithm is optimal since any
algorithm must spend at least \Omega(\frac{(n+m)\log
\sigma}{\log n} + \occ) = \Omega(\frac{n}{\log_\sigma n} + \occ) time to
read the input and report all occurrences. The result is obtained by a novel
automaton construction based on the Knuth-Morris-Pratt algorithm combined with
a new compact representation of subautomata allowing an optimal
tabulation-based simulation.Comment: To appear in Journal of Discrete Algorithms. Special Issue on CPM
200
Revisiting the Problem of Searching on a Line
We revisit the problem of searching for a target at an unknown location on a
line when given upper and lower bounds on the distance D that separates the
initial position of the searcher from the target. Prior to this work, only
asymptotic bounds were known for the optimal competitive ratio achievable by
any search strategy in the worst case. We present the first tight bounds on the
exact optimal competitive ratio achievable, parameterized in terms of the given
bounds on D, along with an optimal search strategy that achieves this
competitive ratio. We prove that this optimal strategy is unique. We
characterize the conditions under which an optimal strategy can be computed
exactly and, when it cannot, we explain how numerical methods can be used
efficiently. In addition, we answer several related open questions, including
the maximal reach problem, and we discuss how to generalize these results to m
rays, for any m >= 2
Faster Approximate String Matching for Short Patterns
We study the classical approximate string matching problem, that is, given
strings and and an error threshold , find all ending positions of
substrings of whose edit distance to is at most . Let and
have lengths and , respectively. On a standard unit-cost word RAM with
word size we present an algorithm using time When is
short, namely, or this
improves the previously best known time bounds for the problem. The result is
achieved using a novel implementation of the Landau-Vishkin algorithm based on
tabulation and word-level parallelism.Comment: To appear in Theory of Computing System
Suffix Tree of Alignment: An Efficient Index for Similar Data
We consider an index data structure for similar strings. The generalized
suffix tree can be a solution for this. The generalized suffix tree of two
strings and is a compacted trie representing all suffixes in and
. It has leaves and can be constructed in time.
However, if the two strings are similar, the generalized suffix tree is not
efficient because it does not exploit the similarity which is usually
represented as an alignment of and .
In this paper we propose a space/time-efficient suffix tree of alignment
which wisely exploits the similarity in an alignment. Our suffix tree for an
alignment of and has leaves where is the sum of
the lengths of all parts of different from and is the sum of the
lengths of some common parts of and . We did not compromise the pattern
search to reduce the space. Our suffix tree can be searched for a pattern
in time where is the number of occurrences of in and
. We also present an efficient algorithm to construct the suffix tree of
alignment. When the suffix tree is constructed from scratch, the algorithm
requires time where is the sum of the lengths
of other common substrings of and . When the suffix tree of is
already given, it requires time.Comment: 12 page
Efficient exact pattern-matching in proteomic sequences
This paper proposes a novel algorithm for complete exact pattern-matching focusing the specificities of protein sequences (alphabet of 20 symbols) but, also highly efficient considering larger alphabets. The searching strategy uses large search windows allowing multiple alignments per iteration. A new filtering heuristic, named compatibility rule, contributed decisively to the efficiency improvement. The new algorithm’s performance is, on average, superior in comparison with its best-rated competitors
- …